821 research outputs found

    A critical look at studies applying over-sampling on the TPEHGDB dataset

    Get PDF
    Preterm birth is the leading cause of death among young children and has a large prevalence globally. Machine learning models, based on features extracted from clinical sources such as electronic patient files, yield promising results. In this study, we review similar studies that constructed predictive models based on a publicly available dataset, called the Term-Preterm EHG Database (TPEHGDB), which contains electrohysterogram signals on top of clinical data. These studies often report near-perfect prediction results, by applying over-sampling as a means of data augmentation. We reconstruct these results to show that they can only be achieved when data augmentation is applied on the entire dataset prior to partitioning into training and testing set. This results in (i) samples that are highly correlated to data points from the test set are introduced and added to the training set, and (ii) artificial samples that are highly correlated to points from the training set being added to the test set. Many previously reported results therefore carry little meaning in terms of the actual effectiveness of the model in making predictions on unseen data in a real-world setting. After focusing on the danger of applying over-sampling strategies before data partitioning, we present a realistic baseline for the TPEHGDB dataset and show how the predictive performance and clinical use can be improved by incorporating features from electrohysterogram sensors and by applying over-sampling on the training set

    On the suitability of resampling techniques for the class imbalance problem in credit scoring

    Get PDF
    In real-life credit scoring applications, the case in which the class of defaulters is under-represented in comparison with the class of non-defaulters is a very common situation, but it has still received little attention. The present paper investigates the suitability and performance of several resampling techniques when applied in conjunction with statistical and artificial intelligence prediction models over five real-world credit data sets, which have artificially been modified to derive different imbalance ratios (proportion of defaulters and non-defaulters examples). Experimental results demonstrate that the use of resampling methods consistently improves the performance given by the original imbalanced data. Besides, it is also important to note that in general, over-sampling techniques perform better than any under-sampling approach.This work has partially been supported by the Spanish Ministry of Education and Science under grant TIN2009– 14205 and the Generalitat Valenciana under grant PROMETEO/2010/ 028

    A swarm intelligence approach in undersampling majority class

    Get PDF
    Over the years, machine learning has been facing the issue of imbalance dataset. It occurs when the number of instances in one class significantly outnumbers the instances in the other class. This study investigates a new approach for balancing the dataset using a swarm intelligence technique, Stochastic Diffusion Search (SDS), to undersample the majority class on a direct marketing dataset. The outcome of the novel application of this swarm intelligence algorithm demonstrates promising results which encourage the possibility of undersampling a majority class by removing redundant data whist protecting the useful data in the dataset. This paper details the behaviour of the proposed algorithm in dealing with this problem and investigates the results which are contrasted against other techniques

    Maximal regularity for non-autonomous equations with measurable dependence on time

    Get PDF
    In this paper we study maximal LpL^p-regularity for evolution equations with time-dependent operators AA. We merely assume a measurable dependence on time. In the first part of the paper we present a new sufficient condition for the LpL^p-boundedness of a class of vector-valued singular integrals which does not rely on H\"ormander conditions in the time variable. This is then used to develop an abstract operator-theoretic approach to maximal regularity. The results are applied to the case of mm-th order elliptic operators AA with time and space-dependent coefficients. Here the highest order coefficients are assumed to be measurable in time and continuous in the space variables. This results in an Lp(Lq)L^p(L^q)-theory for such equations for p,q(1,)p,q\in (1, \infty). In the final section we extend a well-posedness result for quasilinear equations to the time-dependent setting. Here we give an example of a nonlinear parabolic PDE to which the result can be applied.Comment: Application to a quasilinear equation added. Accepted for publication in Potential Analysi

    A prevalent mutation with founder effect in Spanish Recessive Dystrophic Epidermolysis Bullosa families

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Recessive Dystrophic Epidermolysis Bullosa (RDEB) is a genodermatosis caused by more than 500 different mutations in the <it>COL7A1 </it>gene and characterized by blistering of the skin following a minimal friction or mechanical trauma.</p> <p>The identification of a cluster of RDEB pedigrees carrying the c.6527insC mutation in a specific area raises the question of the origin of this mutation from a common ancestor or as a result of a hotspot mutation. The aim of this study was to investigate the origin of the c.6527insC mutation.</p> <p>Methods</p> <p>Haplotypes were constructed by genotyping nine single nucleotides polymorphisms (SNPs) throughout the <it>COL7A1 </it>gene. Haplotypes were determined in RDEB patients and control samples, both of Spanish origin.</p> <p>Results</p> <p>Sixteen different haplotypes were identified in our study. A single haplotype cosegregated with the c.6527insC mutation.</p> <p>Conclusion</p> <p>Haplotype analysis showed that all alleles carrying the c.6527insC mutation shared the same haplotype cosegregating with this mutation (<b><it>CCGCTCAAA_6527insC</it></b>), thus suggesting the presence of a common ancestor.</p

    Molecular Approach to the Identification of Fish in the South China Sea

    Get PDF
    BACKGROUND: DNA barcoding is one means of establishing a rapid, accurate, and cost-effective system for the identification of species. It involves the use of short, standard gene targets to create sequence profiles of known species against sequences of unknowns that can be matched and subsequently identified. The Fish Barcode of Life (FISH-BOL) campaign has the primary goal of gathering DNA barcode records for all the world's fish species. As a contribution to FISH-BOL, we examined the degree to which DNA barcoding can discriminate marine fishes from the South China Sea. METHODOLOGY/PRINCIPAL FINDINGS: DNA barcodes of cytochrome oxidase subunit I (COI) were characterized using 1336 specimens that belong to 242 species fishes from the South China Sea. All specimen provenance data (including digital specimen images and geospatial coordinates of collection localities) and collateral sequence information were assembled using Barcode of Life Data System (BOLD; www.barcodinglife.org). Small intraspecific and large interspecific differences create distinct genetic boundaries among most species. In addition, the efficiency of two mitochondrial genes, 16S rRNA (16S) and cytochrome b (cytb), and one nuclear ribosomal gene, 18S rRNA (18S), was also evaluated for a few select groups of species. CONCLUSIONS/SIGNIFICANCE: The present study provides evidence for the effectiveness of DNA barcoding as a tool for monitoring marine biodiversity. Open access data of fishes from the South China Sea can benefit relative applications in ecology and taxonomy

    Monitoring an Alien Invasion: DNA Barcoding and the Identification of Lionfish and Their Prey on Coral Reefs of the Mexican Caribbean

    Get PDF
    BACKGROUND: In the Mexican Caribbean, the exotic lionfish Pterois volitans has become a species of great concern because of their predatory habits and rapid expansion onto the Mesoamerican coral reef, the second largest continuous reef system in the world. This is the first report of DNA identification of stomach contents of lionfish using the barcode of life reference database (BOLD). METHODOLOGY/PRINCIPAL FINDINGS: We confirm with barcoding that only Pterois volitans is apparently present in the Mexican Caribbean. We analyzed the stomach contents of 157 specimens of P. volitans from various locations in the region. Based on DNA matches in the Barcode of Life Database (BOLD) and GenBank, we identified fishes from five orders, 14 families, 22 genera and 34 species in the stomach contents. The families with the most species represented were Gobiidae and Apogonidae. Some prey taxa are commercially important species. Seven species were new records for the Mexican Caribbean: Apogon mosavi, Coryphopterus venezuelae, C. thrix, C. tortugae, Lythrypnus minimus, Starksia langi and S. ocellata. DNA matches, as well as the presence of intact lionfish in the stomach contents, indicate some degree of cannibalism, a behavior confirmed in this species by the first time. We obtained 45 distinct crustacean prey sequences, from which only 20 taxa could be identified from the BOLD and GenBank databases. The matches were primarily to Decapoda but only a single taxon could be identified to the species level, Euphausia americana. CONCLUSIONS/SIGNIFICANCE: This technique proved to be an efficient and useful method, especially since prey species could be identified from partially-digested remains. The primary limitation is the lack of comprehensive coverage of potential prey species in the region in the BOLD and GenBank databases, especially among invertebrates

    An insight into imbalanced Big Data classification: outcomes and challenges

    Get PDF
    Big Data applications are emerging during the last years, and researchers from many disciplines are aware of the high advantages related to the knowledge extraction from this type of problem. However, traditional learning approaches cannot be directly applied due to scalability issues. To overcome this issue, the MapReduce framework has arisen as a “de facto” solution. Basically, it carries out a “divide-and-conquer” distributed procedure in a fault-tolerant way to adapt for commodity hardware. Being still a recent discipline, few research has been conducted on imbalanced classification for Big Data. The reasons behind this are mainly the difficulties in adapting standard techniques to the MapReduce programming style. Additionally, inner problems of imbalanced data, namely lack of data and small disjuncts, are accentuated during the data partitioning to fit the MapReduce programming style. This paper is designed under three main pillars. First, to present the first outcomes for imbalanced classification in Big Data problems, introducing the current research state of this area. Second, to analyze the behavior of standard pre-processing techniques in this particular framework. Finally, taking into account the experimental results obtained throughout this work, we will carry out a discussion on the challenges and future directions for the topic.This work has been partially supported by the Spanish Ministry of Science and Technology under Projects TIN2014-57251-P and TIN2015-68454-R, the Andalusian Research Plan P11-TIC-7765, the Foundation BBVA Project 75/2016 BigDaPTOOLS, and the National Science Foundation (NSF) Grant IIS-1447795

    Diet and food strategies in a southern al-Andalusian urban environment during Caliphal period, ecija, Sevilla

    Get PDF
    The Iberian medieval period is unique in European history due to the widespread socio-cultural changes that took place after the arrival of Arabs, Berbers and Islam in 711 AD. Recently, isotopic research has been insightful on dietary shifts, status, resource availability and the impact of environment. However, there is no published isotopic research exploring these factors in southern Iberian populations, and as the history of this area differs to the northern regions, this leaves a significant lacuna in our knowledge. This research fills this gap via isotopic analysis of human (n = 66) and faunal (n = 13) samples from the 9th to the 13th century Écija, a town renowned for high temperatures and salinity. Stable carbon (δ13C) and nitrogen (δ15N) isotopes were assessed from rib collagen, while carbon (δ13C) values were derived from enamel apatite. Human diet is consistent with C3 plant consumption with a very minor contribution of C4 plants, an interesting feature considering the suitability of Écija to C4 cereal production. δ15N values vary among adults, which may suggest variable animal protein consumption or isotopic variation within animal species due to differences in foddering. Consideration of δ13C collagen and apatite values together may indicate sugarcane consumption, while moderate δ15N values do not suggest a strong aridity or salinity effect. Comparison with other Iberian groups shows similarities relating to time and location rather than by religion, although more multi-isotopic studies combined with zooarchaeology and botany may reveal subtle differences unobservable in carbon and nitrogen collagen studies alone.OLC is funded by Plan Galego I2C mod.B (ED481D 2017/014). The research was partially funded by the projects “Galician Paleodiet” and by Consiliencia network (ED 431D2017/08) Xunta de GaliciaS
    corecore